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Abstract 

We address the problem of nonparametric estimation of characteristics for 
stationary and ergodic time series. We consider finite-alphabet time series 
and real- valued ones and the following four problems: i) estimation of the 
(limiting) probability P{uq . . .Us) for every s and each sequence uq - ■ -Us of 
letters from the process alphabet (or estimation of the density p{xo, ■ ■ ■ , Xg) for 
real- valued time series), ii) so-called on-line prediction, where the conditional 
probability P(xt+i/xiX2 • • • xt) (or the conditional density p{xt+i/xiX2 ■ ■ ■ xt)) 
should be estimated (in the case where xiX2 ■ ■ ■ xt is known), iii) regression and 
iv) classification (or so-called problems with side information). 

We show that so-called archivers (or data compressors) can be used as a 
tool for solving these problems. In particular, firstly, it is proven that any so- 
called universal code (or universal data compressor) can be used as a basis for 
constructing asymptotically optimal methods for the above problems. (By defi- 
nition, a universal code can "compress" any sequence generated by a stationary 
and ergodic source asymptotically till the Shannon entropy of the source.) And, 
secondly, we show experimentally that estimates, which are based on practically 
used methods of data compression, have a reasonable precision. 

AMS subject classification: 60G10, 60J10, 62G07, 62G08, 62M20, 94A29. 

keywords: time series, nonparametric estimation, prediction, universal coding, 
data compression, on-line prediction. Shannon entropy, stationary and ergodic pro- 
cess, regression. 



1 Introduction 

We consider a stationary and ergodic source, which generates sequences X1X2 ■ ■ ■ of 
elements (letters) from some set (alphabet) A, which is either finite or real-valued. It 
is supposed that the probability distribution (or distribution of limiting probabilities) 
P{xi = ai^,X2 = ai^,...,Xt = OiJ (or the density p{xi,X2, ■ ■ ■ ,Xt)) is unknown, 
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but we are given either one sample xi . . .Xt or several (r) non-overlapping samples 
OC — X ... x ^ . . . ^ X — X ... X-^ generated by the source. (Here non-overlapping 
means that the sequences either are parts of deferent realizations or belongs to non- 
overlapping parts of one realization, say, a realization with gaps. Generally speaking, 
they cannot be combined into one sample for a stationary and ergodic source, as it 
can be done for an i.i.d. one.) 

Of course, if someone knows the probability distribution (or the density) he has 
all information about the source and can solve all problems in the best way. Hence, 
generally speaking, precise estimations of the probability distribution and the density 
can be used for prediction, regression estimation, etc. In this paper we follow the 
scheme: we consider the problems of estimation of the probability distribution or the 
density estimation. Then we show how the solution can be applied to other prob- 
lems, paying the main attention to the problem of prediction, because of its practical 
applications and importance for probability theory, information theory, statistics and 
other theoretical sciences, see[Tl[Tll[l5l[I7l[2ll[2Sl[271llT]. 

We show that universal codes (or data compressors) can be applied directly to 
the problems of estimation, prediction, regression and classification. It is not surpris- 
ing, because for any stationary and ergodic source p generating letters from a finite 
alphabet and any universal code U the following equality is valid with probability 1: 

lim ^(-logp(xi ■■■Xt) - \U{xi ■■■xt)\) = 0, 

where xi ■ ■ - Xj is generated by p. (Here and below log = log2, |f | is the length of v, 
if f is a word and the number of elements of f if f is a set.) So, in fact, the length of 
the universal code {\U{xi ■ ■ ■ Xt)\) can be used as an estimate of the logarithm of the 
unknown probability and, obviously, 2~l^(^i "^')l can be considered as the estimation of 
p{xi ■ ■ ■Xt). In fact, a universal code can be viewed as a non-parametrical estimation 
of (limiting) probabilities for stationary and ergodic sources. This was recognized 
shortly after the discovery of universal codes (for the set of stationary and ergodic 
processes with finite alphabets [22]) and universal codes were applied for solving 
prediction problem [30] . 

We would like to emphasize that, on the one hand, all results are obtained in 
the framework of classical probability theory and mathematical statistics and, on the 
other hand, everyday methods of data compression (or archivers) can be used as a 
tool for density estimation, prediction and other problems, because they are practical 
realizations of universal codes. It is worth noting that the modern data compressors 
(like zip, arj, rar, etc.) are based on deep theoretical results of the theory of source 
coding (see, for ex., [TOl [HI [221 [271 [37] ) and have been demonstrated high efficiency in 
practice as compressors of texts, DNA sequences and many other types of real data. 
In fact, archivers can find many kinds of latent regularities, that is why they look 
like a promising tool for prediction and other problems. Moreover, recently universal 
codes and archivers were efficiently applied to some problems which are very far from 
data compression: first, their applications in |H|5] created a new and rapidly growing 
line of investigation in clustering and classification and, second, universal codes were 
used as a basis for non-parametric tests for the main statistical hypotheses concerned 
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with stationary and ergodic time series [331 Elj- 

The outhne of the paper is as follows. The section 2 contains description of the 
Laplace predictor and its generalizations, a review of known results and description 
of one universal code. The sections 3 and 4 are devoted to processes with finite and 
real-valued alphabets, correspondingly. The last part contains some examples and 
simulations. 

2 Predictors and universal data compressors 

2.1 The Laplace measure and on-line prediction for i.i.d. 
processes 

We consider a source with unknown statistics which generates sequences xiX2 ■ ■ ■ of 
letters from some set (or alphabet) A. Let the source generate a message xi . . . Xt-iXt . . ., 
Xi E A for all i, and the following letter xt+i needs to be predicted. 

It will be convenient at first to describe briefly the prediction problem. This 
problem can be traced back to Laplace [TT]. He considered the problem how to 
estimate the probability that the sun will rise tomorrow, given that it has risen 
every day since Creation. In our notation the alphabet A contains two letters 
{"the sun rises") and 1 {"the sun does not rise"), t is the number of days since 
Creation, xi . . . Xt-iXt = 00 ... 0. 

Laplace suggested the following predictor: 

Lo(a|xi---Xt) = {v,,...^,{a) + l)/{t+\A\), (1) 

see [TT], where z/^i.-.^t (a) denote the count of letter a occurring in the word Xi . . . Xt-iXt- 
For example, ii A = {0, 1}, X1...X5 = 01010, then the Laplace prediction is as follows: 
Lo(x6 = 0|01010) = (3 + l)/(5 + 2) = 4/7, LqK = 1|01010) = (2 + l)/(5 + 2) = 3/7. 
In other words, 3/7 and 4/7 are estimations of the unknown probabilities P{xt+i = 
0\xi ...xt = 01010) and P{xt+i = l\xi...Xt = 01010). 

We can see that Laplace considered prediction as a set of estimations of unknown 
(conditional) probabilities. This approach to the problem of prediction was developed 
in [30] and now is often called on-line prediction or universal prediction [H [HI [25] . 
As we mentioned above, it seems natural to consider conditional probabilities to be 
the best prediction, because they contain all information about the future behavior 
of the stochastic process. Moreover, this approach is deeply connected with game- 
theoretical interpretation of prediction (see [16l[32]) and, in fact, all obtained results 
can be easily transferred from one model to the other. 

Any predictor 7 defines a measure by following equation 

t 

'y{xi...xt) = Y['j{xi\xi...Xi_i). (2) 

i=l 

For example, Lo(OlOl) = ^|^| = And, vice versa, any measure 7 (or estimation 
of the measure) defines a predictor: 'j{xi\xi...Xi_i) = 'j{xi...Xi_iXi)/'j{xi...Xi_i). The 
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same is true for a density (and its estimation): a predictor is defined by conditional 
density and, vice versa, the density is equal to the product of conditional densities: 

t 

p{Xi\xi...Xi-i) = p{xi...Xi^iXi)/p{xi...Xi-i),p{xi...Xt) = 

The next natural question is how to estimate the precision or of the prediction 
and an estimation of probability. Mainly we will estimate the error of prediction by 
the Kullback-Leibler (KL) divergence between a distribution p and its estimation. 
Consider an (unknown) source p and some predictor 7. The error is characterized 
by the KL divergence 

p^,p(xi ■■■xt) = V p{a\xi ■■■xt) log EM^^ (3) 

^{a\xi---xt) 

It is well-known that for any distributions p and 7 the K-L divergence is nonnegative 
and equals if and only if p{a) = 7(a) for all a, see, for ex., [13]. The following 
inequality (Pinsker's inequality) 

connects the KL divergence with a so-called variation distance 

\\P-Q\\ = J2\P{a)-Q{a)l 

a<^A 

where P and Q are distributions over A, see [6]. For fixed t, p-y,p{) is a random 
variable, because Xi,X2, ■ ■ ■ ,Xt are random variables. We define the average error at 
time t by 

P*(P||7) = ^ (P7,p(-)) = E P{xi---xt) p^,p{xi---xt). (5) 

It is shown in [3T] that the error of Laplace predictor Lq goes to for any i.i.d. source 
p. More precisely, it is proven that 

p\p\\Lo)<{\A\-l)/{t + l) (6) 

for any source p, ([31]; see also [35j). So, we can see from this inequality that the 
average error of the Laplace predictor Lq (estimated either by the KL divergence 
or the variation distance ) goes to zero for any unknown i.i.d. source, when the 
sample size t grows. Moreover, it can be easily shown that the error (and the 
corresponding variation distance) goes to zero with probability 1, when t goes to 
infinity. Obviously, such a property is very desirable for any predictor and for larger 
classes of sources, like Markov, stationary and ergodic, etc. However, it is proven in 
[50] (see also [IlIIllES]) that such predictors do not exist for the class of all stationary 
and ergodic sources (generated letters from a given finite alphabet). More precisely. 
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for any predictor 7 there exists a source p and S > such that with probabihty 1 
P7,p(a;i ■ ■ - Xt) > S infinitely often when t ^ 00. So, the error of any predictor does 
not go to 0, if the predictor is apphed to all stationary and ergodic sources, that is 
why it is difficult to use ([3]) and ([5]) for comparison of different predictors. 

On the other hand, it is shown in [30] that there exists a predictor R, such that 
the following Cesaro average J2i=i PR,p{xi ■ ■ - Xt) goes to (with probability 1 ) 
for any stationary and ergodic source p, where t goes to infinity. That is why we will 
focus our attention on such averages and by analogy with ([5]) we define 

p^,p(xi...xt) = {\og{p{xi...Xt) h{xi...xt)) (7) 

and 

Ptin,p)=t'^ J2 pixi...xt)\og{p{xi...xt)/'y{xi...xt)), (8) 

xi...xt&A* 

where, as before, 'j{xi...Xt) = Y[l=il{xi\xi...Xi-i). 

From these definitions and ([6]) we obtain the following estimation of the error of 
the Laplace predictor Lq for any i.i.d. source: 

Pt{Lo,p) <{{\A\-l)logt + c)/t, (9) 

where c is a certain constant. So, we can see that the average error of the Laplace 
predictor goes to zero for any i.i.d. source (which generates letters from a known finite 
alphabet). As a matter of fact, the Laplace probability Lq^xi-.-Xi) is a consistent 
estimate of the unknown probability p{xi...Xt). 

The natural problem is to find a predictor whose error is minimal (for i.i.d. 
sources). This problem was considered and solved by Krichevsky [21], see also [22] . 
He suggested the following predictor: 

Ko{a\xi ■■■xt) = (z/.,..,,(a) + l/2)/(t + \A\/2), (10) 

where, as before, i'xvxt{0') denote the count of letter a occurring in the word xi . . . Xf. 
We can see that the Krychevsky predictor is quite close to the Laplace's one ([1]). For 
example, if A = {0,1}, X1...X5 = 01010, then Ko{xe = 0|01010) = (3 + l/2)/(5 + 1) = 
7/12, Ko{xq = 1|01010) = (2 + l/2)/(5 + 1) = 5/12 and 7^0(01010) = Je- 
The Krichevsky measure Kq can be presented as follows: 

It is known that 

(r + l/2)((r + 1) + l/2)...(. - 1/2) = Y^^J^y (^2) 

where r( ) is the gamma function (see for definition, for ex., [12] ). So, flTT]) can be 
presented as follows: 

T.(^ ^_ nagA(r(^„,(a) + i/2)/r(i/2)) 

" T{t+\A\/2) /T{\A\/2) • ^^^^ 
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For this predictor 

-pt{K,,p) < {{\A\ - 1) logt + c)/(2t), (14) 

where c is a constant, and, moreover, in a certain sense this average error is minimah 
for any predictor 7 there exists such a source p* that 

Pi(7,p*)>((|A|-l) logt + c)/(2t), 

see [21], [22]. 

2.2 Consistent estimations and on-line predictors for Markov 
and ergodic processes 

Now we briefly describe consistent estimations of unknown probabihties and efficient 
on-hne predictors for general stochastic processes (or sources of information). Denote 
by A* and A* the set of all words of length t over A and the set of all finite words over 
A correspondingly {A* = [j'^i A*). By Moq{A) we denote the set of all stationary and 
ergodic sources, which generate letters from A and let Mo{A) C Moq{A) be the set of 
all i.i.d. processes. Let Mm{A) C Moo{A) be the set of Markov sources of order (or 
with memory, or connectivity) not larger than m, m > 0. Let M*{A) = [j°lQ Mi{A) 
be the set of all finite-order sources. 

The Laplace and Krichevsky predictors can be extended to general Markov pro- 
cesses. The trick is to view a Markov source p G Mm{A) as resulting from \A\"^ i.i.d. 
sources. We illustrate this idea by an example from ^5]. So assume that A = {O, I}, 
m = 2 and assume that the source p G M2{A) has generated the sequence 

OOIOIIOOIIIOIO. 

We represent this sequence by the following four subsequences: 

* * *(9 O, 

*0 * * * */*, 

^^^^^^O*** 10 * *. 

These four subsequences contain letters which follow 00, 01, 10 and //, respectively. 
By definition, p G Mm{A) if p{a\xi ■ ■ ■ Xt) = p{a\xt-m+i ■ ■ -Xt), for all < m < t, all 
a G A and all Xi---Xt G A*. Therefore, each of the four generated subsequences 
may be considered to be generated by a Bernoulli source. Further, it is possible to 
reconstruct the original sequence if we know the four (= 1^41"*) subsequences and the 
two (= m) first letters of the original sequence. 

Any predictor 7 for i.i.d. sources can be applied for Markov sources. Indeed, in or- 
der to predict, it is enough to store in the memory \A\'^ sequences, one corresponding 
to each word in A^. Thus, in the example, the letter 0:3 which follows 00 is predicted 
based on the Bernoulli method 7 corresponding to the X1X2- subsequence (= 00), 
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then X4 is predicted based on the Bernoulh method corresponding to 0:2X3, i.e. to the 
01- subsequence, and so forth. When this scheme is apphed along with either Lq or Kq 
we denote the obtained predictors as Lm and Km, correspondingly and define the prob- 
abilities for the first m letters as follows: Lm{xi) = Lm{x2) = . . . = Lm{xm) = l/l^l > 
Km{xi) = Km{x2) = . . . = Km{xm) = 1/|^| • For example, having taken into account 
( 1131) . we can present the Krichevsky predictors for Mm{A) as follows: 

ift<m, 

Km{xi...xt) - i ^ l\^^^mu4va)+l/2)/r{l/2)) ^^^"^ 

\A\^ ilveA"^ (r{ua:iv)+\A\/2) /ri\A\/2)) ' II > ) 



where Ux{v) = J2a£A ^x{va), x = xi...Xt- It is worth noting that the representation f[T2|) 
can be more convenient for carrying out calculations. Let us consider an example. 
For the word OOIOIIOOIIIOIO considered in the previous example, we obtain 
K2iOOIOIIOOIIIOIO)=2~' If iiif ili 11| . 

Let us define the measure R, which, in fact, is a consistent estimator of probabil- 
ities for the class of all stationary and ergodic processes with a finite alphabet. First 
we define a probability distribution {c<j = cji, c<j2, ...} on integers {1,2, ...} by 

= l-l/log3, = l/log(« + l)-l/log(^ + 2), ... . (16) 

(In what follows we will use this distribution, but results described below are obviously 
true for any distribution with nonzero probabilities.) The measure R is defined as 
follows: 

00 

R{xi...xt) = J2 Ki{xi...xt). (17) 

2 = 

It is worth noting that this construction can be applied to the Laplace measure (if 
we use Li instead of Ki) and any other family of measures. 

The main properties of the measure R are connected with the Shannon entropy, 
which is defined as follows 

^b) = ii™o-;^ E p(w)iogp(w). (18) 

'''' t)6^™ 

Theorem 1. [30]. For any stationary and ergodic source p the following equali- 
ties are valid: ^ 

lim - log(l/-R(a;i ■ ■ ■ a;j)) = H{p) 



with probability 1, 



\ 5Z P{u)\og{l/R{u)) = H{p). 



2.3 Nonparametric estimations and data compression 

One of the goals of the paper is to show how practically used data compressors can 
be used as a tool for nonparametric estimation, prediction and other problems. That 
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is why a short description of universal data compressors (or universal codes) will be 
given here. 

A data compression method (or code) Lp is defined as a set of mappings (pn such 
that (fn '■ ^ {0, 1}*, n = 1, 2, . . . and for each pair of different words x, ?/ G A" 
V9„(x) 7^ fn{y)- It is also required that each sequence ipn{ui)ipn{u2)-..fn{ur),r > 1, 
of encoded words from the set A",n > 1, could be uniquely decoded into UiU2-..Ur. 
Such codes are called uniquely decodable. For example, let A = {a,b}, the code 
-01(0) = 0,ipi{b) = 00, obviously, is not uniquely decodable. It is well known that if a 
code ip is uniquely decodable then the lengths of the codewords satisfy the following 
inequality (Kraft's inequality): Eug^" 2"I'^"(")I < 1 , see, for ex., [13]. It will be 
convenient to reformulate this property as follows: 

Claim 1. Let ip he a uniquely decodable code over an alphabet A. Then for any 
integer n there exists a measure fi^ on A^ such that 

-log/i^(M) < \ipiu)\ (19) 

for any u from A^ . 

(Obviously, Claim 1 is true for the measure fi^(u) = 
what follows we call uniquely decodable codes just "codes". 

It is worth noting that, in fact, any measure defines a code for which the length 
of the codeword associated with a word u is (close to) — log /!(«). 

Now we consider universal codes. By definition, a code U is universal if for any 
stationary and ergodic source p the following equalities are valid: 

\im \U{xi... xt)\/t = H{p) (20) 

with probability 1, and 

]imE{\U{xi...Xt)\)/t = H{p), (21) 

where H{p) is the Shannon entropy of p, E[f) is a mean value of /. In fact, fl2T|) 
and fl20|) are valid for known universal codes, but there exist codes for which only one 
equality is valid. 

3 Finite-alphabet processes 

3.1 The estimation of (limiting) probabilities 

The following theorem shows how universal codes can be applied for probability esti- 
mations. 

Theorem 2. Let U be a universal code and 

/Xf/(«) = 2-l^(")l/S,,^M 2-1^(^)1. (22) 
Then, for any stationary and ergodic source p the following equalities are valid: 

i) \im.\{-\ogp{xi- ■ -xt) - {-\og^iu{xi- ■ -xt))) = 
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with probability 1 



a) 



lim - p{u)\og{p{u)/fiuiu)) = 0, 

t — *cxD r — , 



III 



) J™ 7 P(^) b(^) - f^u{u)\ = 0. 



Proof is based on Shannon-MacMillan-Breiman Theorem which states that for 
any stationary and ergodic source p 



with probabihty 1, see [3], [13]. From this equahty and fl20l) we obtain the statement 
i). The second statement follows from the definition of Shannon entropy f|T8|) and 
( l2Til . whereas iii) follows from ii) and the Pinsker's inequality (jlj). 

So, we can see that, in a certain sense, the measure /if/ is a consistent (nonpara- 
metric) estimation of the (unknown) measure p. 

Nowadays there are many efficient universal codes (and universal predictors con- 
nected with them), see [151 [TTl [26l [271 EQl [37] , which can be applied to estimation. For 
example, the above described measure R is based on the code from [291 ISD] and can 
be applied for probability estimation. More precisely. Theorem 2 (and the following 
theorems) are true for R, if we replace fiu by R. 

It is important to note that the measure R has some additional properties, which 
can be useful for applications. The following theorem will be devoted to description 
of these properties (whereas all other theorems are valid for all universal codes and 
corresponding them measures, including the measure R). 

Theorem 3. For any Markov process p with memory k 

i) the error of the probability estimator, which is based on the measure R, is upper- 
bounded as follows: 



ii) in a certain sense the error of R is asymptotically minimal: for any measure 
/i there exists a k— memory Markov process such that 



iii) Let be such a set of stationary and ergodic processes that there exists a 
measure fie for which the estimation error of the probability goes to uniformly: 



pee I 

Then the error of estimator, which is based on the measure R, goes to uniformly, 
too: 








Proof can be found in (SUl [21] ■ 
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3.2 Prediction 

As we mentioned above, any universal code U can be applied for prediction. Namely, 
the measure fiu (122!) can be used for prediction as the following conditional probabil- 
ity: 

fiu{xt+i\xi...xt) = fiu{xi...XtXt+i)/fiu{xi...Xt). (23) 

Theorem 4. Let U be a universal code and p be any stationary and ergodic 
process. Then 

i) lim - {Eilog l^^ilL) + ^(log ^(^^ki) ^ ^ p{xt\xi...Xt-i) y ^ ^ 

t^oo t iiu{xi) ^J'u{x2\Xl) "' fIu{Xt\Xl...Xt^l) 

1 

m) lim E(- V(p(xi+i|xi...Xi) - = 0, 
1 

m) lim E(- V |p(xj+i|xi...a;i) -/i(;(xi+i|a;i...a;i)|) = 0. 

Proof i) immediately follows from the second statement of the previous theorem 
and properties of log. The statement ii) can be proved as follows: 

1 

lim E{- V(p(xi+i|xi ...Xi)- fiu{xi+i\xi . . . Xijf) < 
1 

lim E{- V( V \p{a\xi . . . Xi) - lJu{a\xi . .-Xi)])^) < 

^lim ^ p{a\xi ...Xi) log(p(a|xi . . . Xi) / Hu^o-W • • • Xi))) = 

lim(^^^^ Vp(xi ...Xi) Vp(a|xi ...Xi) log(p(a|xi . . .Xi)/iJ.u{a\xi . ..Xi))) = 
lim(^^^^ V p{xi. . .Xt)\og{j){xi. . .Xt)/^^{xl. . .Xt))). 

t — ^oo J — 

x\...xt(liA^ 

Here the first inequality is obvious, the second follows from the Pinsker's inequality 
(jl]), the others from properties of expectation and log. iii) can be derived from ii) 
and the Jensen inequality for the function x^. Theorem is proven. 

Comment 1. The measure R described above has one additional property, if it is 
used for prediction. Namely, for any Markov process p (p G M*[A) ) the following is 
true: 

p{xt+i\xi...xt) 
lim log — ■ = 

i^oo R[Xt+l\Xi...Xt) 

with probability 1, where R{xt+i\xi...Xt) = R{xi...XtXt+i)/ R{xi...Xt)] see [31] . 

Comment 2. In fact, the statements ii) and iii) are equivalent, because one of 
them follows from the other. For details see Lemma 2 in 
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3.3 Problems with side information 



Now we consider so-called problems with side information, which are described as 
follows: there is a stationary and ergodic source, whose alphabet A is presented as a 
product A = XxY. We are given a sequence (xi, yi), . . . , yt-i) and so-called side 
information yt. The goal is to predict, or estimate, Xf. This problem arises in statistical 
decision theory, pattern recognition, and machine learning, see [25] • Obviously, if 
someone knows the conditional probabilities p{xt\ {xi,yi), . . . , {xt-i,yt-i),yt) for all 
Xt G X, he has all information about Xt, available before Xt is known. That is why we 
will look for the best (or, at least, good) estimations for this conditional probabilities. 
Our solution will be based on results obtained in two previous subparagraphs. More 
precisely, for any universal code U and the corresponding measure fiu ([22]) we define 
the following estimate for the problem with side information: 

. i. N . N N fJ'uiixi,yi),. . . ,{xt-i,yt-i),ixt,yt)) 

Hu{xt\{xi,yi), . . . , {xt-i,yt-i),yt) - 



Ex'tex fJ'u{{xi,yi), {xt-i,yt-i), {xt, yt)) ' 

Theorem 5. Let U be a universal code and p be any stationary and ergodic 
process. Then 

■\ V ^ rz?n x , rpn P{x2\{xi,yi),y2) . . 
V lim - {^(log — ) + ^(log — — — ^ -) + . . . 

i fJ'uixilVi) lJ'u{x2\{xi,yi),y2) 

, p,., p{xt\{xi,yi),.-,{xt-i,yt-i),yt) „ 
fiu{xt\{xi,yi), {xt-i,yt-i),yt) 

1 



and 



IJu{xi+i\{xi,yi),...,{xi,yi),yi+i)) ) =0, 
1 

Hi) limE{- ^\p{xi+i\{xi,yi),...,{xi,yi),yi+i))- 

i=0 

fxuixi+i\{xi,yi),...,{xi,yi),yi+i)\) = 0. 

Proof. The following inequality follows from the nonnegativity of the K-L diver- 
gency (see dl])), whereas equality is obvious. 

£(iog 4^^) + s(iog p(-^-\(--y'Y'^ , + . . . < 

fiu{xi\yi) fJ^u{x2\{xi,yi),y2) 

E[}og — ——)+h[}og — - — . — -)+h[}og — - — -)+h[}og — - — -)+. . . 

l^uivi) i^u{xi\yi) iJ'u{y2\{xi,yi) iJ'u{x2\{xi,yi),y2) 

p., p{xi,yi) ^ p{{x2,y2)\{xi,yi)) 
= E log — + E log — + .... 

fJ'u{xi,yi) lJ'u{{x2,y2)\{xi,yi)) 
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Now we can apply the first statement of tlie tlieorem 4 to tlie last sum as follows: 

t-^oot ij,u{xi,yi) fJ'u{{x2,y2)\{xi,yi)) 

^Q^g p{{xt,yt)\{xi,yi) . . . {xt-i,yt-i)) . ^ ^ 
liu{{xt,yt)\{xi,yi) . . . {xt-i,yt-i)) 

From this equality and last inequality we obtain the proof of i). The proof of the 
second statement can be obtained from the similar representation for ii) and the 
second statement of the theorem 4. iii) can be derived from ii) and the Jensen 
inequality for the function x^. Theorem is proven. 

3.4 The case of several independent samples 

Now we extend our consideration to the case where the sample is presented as several 
non-overlapping sequences x^ = x\ . . . x}^, x'^ = xf . . . x^^, x'^ = x[ . . . x^^ generated 
by a source. More precisely, we will suppose that all sequences were created by 
one stationary and ergodic source. (As it was mentioned above, it is impossible 
just to combine all samples into one, if the source is not i.i.d.) We denote this 
sample by o o . . . o and define i^x'^ox^o...0x^{v) = J2i=i ^x^iv). For example, if 
x^ = 0010, x^ = Oil, then Uxi^x^iOO) = 1. The definition of Km and R can be extended 
to this case: 

Kmix^ox^o...ox'')= (24) 

/-TT I .|-minW.}N TT IlagA ( (F (z/^ W^...^^. (H + 1/2) / 1(1/2)) 

\tV ' ' Ji. {mr.x^,..Mv) + \A\/2)/T{\A\/2)) ' 

whereas the definition of R is the same (see ( IT7|) ). (Here, as before, i^x'^ox'2o...ox^{'v) = 
J2aGA^x^ox^o...ox-{va). Note, that z/^ioa;2o...ox' ( ) = ELi^i if m = 0.) 

The following example is intended to show the difference between the case of many 
samples and one. Let there be two independent samples y = yi . . .y^ = 0101 and 
X = Xi . . . X3 = 101, generated by a stationary and ergodic source with the alphabet 
{0, 1}. One wants to estimate the (limiting) probabilities P{ziZ2), Zi, Z2 G {0, 1} (here 
Z1Z2 ■ ■ . can be considered as an independent sequence, generated by the source) and 
predict X4X5 (i.e. estimate conditional probability P^x^x^lxi . . . X3 = 101, yi . . .y^ = 
0101). For solving both problems we will use the measure R (see f[T7|) ). First we 
consider the case where P{ziZ2) is to be estimated without knowledge of sequences x 
and y. From (fTT!) and (fT5l) we obtain: 

i^o(oo) = Ko(ii) = ^-YTTT^ = = TTT = 

KiiOO) = KiiOl) = KiilO) = Ki{ll) = 1/4; , t > 1. 

Having taken into account the definitions of uji f|T6l) and the measure R f|T7|) . we can 
calculate R{ziZ2) as follows: 

R{00) = uJiKoiOO) + uj2Ki{00) + . . . = (1 - 1/ log3) 3/8 + (1/ log3 - 1/ log4) 1/4+ 
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(l/log4- l/log5) 1/4+ = (l-l/log3) 3/8+(l/log3) 1/4 0.296. 

Analogously, R{01) = R{10) ^ 0.204, ^ 0.296. 

Let us now estimate the probability P{ziZ2) taking into account that there are 
two independent samples y = yi . . .y^ = 0101 and x = xi . . . x->, = 101. First of all we 
note that such estimates are based on the formula for conditional probabilities: 

R{z\xoy) — R{x oy o z)/R{x oy). 

First we estimate the frequencies : 

i^oioioioi(O) = 3, i/oioioioi(l) — 4, i^oioioioi(OO) = i^oioioioi(ll) — 0, i^oioioioi(Ol) = 3, 

i^oioioioi(lO) = 2,1/01010101 (010) = l,z/oioioioi(101) = 2, z/oioioioi(0101) = 1, 

whereas frequencies of all other tree-letters and four- letters words are 0. Then we 
calculate : 

1357 1 3 5 , 1,, 135 13 

Koi 0101 o 101) = 7 ~ 0.00244, 0101 o 101 = 2"^^ --- 1 -- 1 

^ ^ 246 81012 14 ' ^ ^ ^ ^ 246 24 

fa 0.0293, 0101 o 101) 0.01172, 0101 o 101) = 2"^ i > 3, 

R{ 0101 o 101) = UiKo{ 0101 o 101) + u;2i^i( 0101 o 101) + ...^ 

0.369 0.00244 + 0.131 0.0293 + 0.06932 0.01172 + 2"^ / log 5 fa 0.0089. 

In order to avoid repetitions, we estimate only one probability P{ziZ2 — 01). Carrying 
out similar calculations, we obtain 

ii:(0101 o 101 o 01) Ri 0.00292, 

R{ziZ2 = 01|yi . . .y4 = 0101, Xi . . .X3 = 101) = i?(0101^101o01)/i?( 0101^101) ^ 0.32812. 

If we compare this value and the estimation i?(01) ~ 0.204, which is not based on the 
knowledge of samples x and y, we can see that that the measure R uses additional 
information quite naturally (indeed, 01 is quite frequent va. y = yi . . .y^ = 0101 and 
x — xi . . .xz — 101). 

Such generalization can be applied for many universal codes, but, generally speak- 
ing, there exist codes U for which U{xi 0x2) is not defined for independent samples 
Xi and X2 and, hence, the measure fiui^i o X2) is not defined. That is why we will 
not describe properties of any universal code, but for R only. For the measure R all 
asymptotic properties are the same for a case of one sample and several ones. More 
precisely, the following statement is true: 

Claim 2. Let x^ox'^o ...ox"^ he non- overlapping samples generated by a stationary 
and ergodic source and t be a total length of those samples (t = J2i=i I^^D- Then, if 
t 00, (and r is fixed) the statements of the Theorems 1-5 are valid, when applied 
to x^ ox"^ o ... ox^ instead of the one sample xi . . . Xt. (In theorems 2, 4, 5 nu should 
be changed in R.) 

The proofs are analogous to the proofs of the Theorems 1-5. 
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4 Real-valued time series 



Let Xt be a time series with each Xt taking values in some interval A. The probability 
distribution of Xt is unknown but it is known that the time series is stationary 
and ergodic. Let {n„},n > 1, be an increasing sequence of finite partitions that 
asymptotically generates the Borel sigma- field on A, and let x^^' denote the element 
of Ilfc that contains the point x. (Informally, a:^*^' is obtained by quantizing x to k 
bits of precision.) Suppose that the joint distribution P„ for (Xi,...,X„) has a 
probability density function . . . , x„) with respect to a sigma-finite measure A„. 

(For example, A„ can be Lebesgue measure, counting measure, etc.) For integers s 
and n we define the following approximation of the density 

f{x„ . . . ,x„) = P(xW . . . ,xH)/A„(xS^^ . . .xW)- (25) 

Let p{xn+i \xi, . . . , Xn) denote the conditional density given by the ratio p{xi, . . . , x„+i) 
/p{xi, . . . ,Xn) for n > 1. It is known that for stationary and ergodic processes there 
exists a so-called relative entropy rate h defined by 

h = Jirn E(logp(x„+i|a;i, . . .,Xn)), (26) 

where E denotes expectation with respect to P; see [2]. We also consider 

hs = Urn Eilogp^^Xn+ilxi, . . . ,Xn)). (27) 

rt^oo 

It is shown by Barron [2j that almost surely 

lim - log]?(a;i . . . xt) = h. (28) 
Applying the same theorem to the density p*(xi, . . . , Xt), we obtain that a.s. 

\im\\ogp'{xi,. . . ,xt) = hs. (29) 

Let [/ be a universal code, which is defined for any finite alphabet. We define the 
corresponding density rjj as follows: 

OO (1 (1 

r^(xi ...xt) = E^.2-l^(4''..-r')l/A,(4^] . . .4^1) . (30) 

i=0 

(It is supposed here that the code U{x^i . . .x[*^) is defined for the alphabet, which 
contains |nj| letters.) 

It turns out that, in a certain sense, the density ru{xi . . .Xt) estimates the un- 
known density p{xi, . . . ,xt). 

Theorem 6 . Let Xf be a stationary ergodic process with densities p{xi, . . . ,Xt) 
= dPt/dXt such that h < oo, where h is relative entropy rate, see j^) . Then the 
following equality is true with probability 1: 
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1. 1 ri Pj^l) , , 1 p{Xn+l\Xl...Xn) p{xt\xi. . .Xj^i) 

hm - {log — -— +. . . + log — . +... + log^ } = 0. 31 

t^oo t ru{Xi) ru{Xn+l\Xi...Xn) ru{Xt\Xi...Xt-l) 

Proof. First we note that the following equality can be easily derived from the 
definitions and martingale properties: 

lim hs = h. (32) 
It can be seen that (13T|) is equivalent to the following equality. 

lim 1 log 4^111^ = 0. (33) 

First we note that for any integer s the following obvious equality is true: r^(xi . . .Xt) = 
/Xt{xi^ . . . x^f^) (1 + 6) for some S > 0. From this equality and (155]) we 
immediately obtain that a.s. 

lim - log P^^llll^ < lin, 1 log ./[^'■■■^i rr ■ (34) 

The right part can be presented as follows: 



1 p{xi...xt) 1 p^(xi, . . . ,Xi) At(xi'^ 



X 



lim - log _ ,,,,M UK,;;" ui nr = lim 7 log ' " " " ' ... J.i " " " (35) 



t^oo t 



1 p{xi ...Xt) 
+ lim - log 



t^oo t p^ixi, . . . ,Xt)' 

Having taken into account that U is the universal code and ( 125|) . we can see that the 
first term equals to zero. From f l28l) and ( l29i) we can see that a.s. the second term 
is equal to h — h^. This equality is valid for any integer s and, according to (132|) . 
lim^^oo hs = h. Hence, the second term equals to zero, too, and we obtain the proof 
of (!33ll . The theorem is proven. 
Corollary 1. 

lim -E[\og — ; -) = 0. 

t-*oo t ru[xi...Xt) 

Proof. Analogously to (1341) and (1351) we can obtain the following enequality 

^ p{Xi ...Xt) ^ ^ p{xi ...Xt) Pt{Xi, . . . ,Xf) Xtjx'f' . ..xf) 

ru{x^ ...Xt) - 2-l^(4^'-4'')l /Ai(xS'^ xf) ~ 2-l^(4^'-4^')l 

(36) 

p(a;i...a;t) 
+ i°g^7 V- 
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for any integer s. Hence, 

^-E(log Pi^llll^ ) < E(i log Pti^u...,x,)Mx^...x?)^_^^^l p{x, ... X,) 



t ru{xi...xt) t <)l t . . . 

(37) 

The first term is the average redundancy of a universal code for a finite-alphabet 
source, hence, it tends to according to the definition of the universal code. The 
second term tends to h — hg for any s, hence, it is equals to zero. Corollary 1 is 
proven. 

Corollary 2. 

i) lim ^ / {p{xi ...xt)- ru{xi ...Xt) f dXt = 0, 
a) lim 7 / I p{xi ...Xt) - ru{xi ...Xt) \ dXt = 0. 

t^oo t J 

Proof i) immediately follows from the corollary 1 and the Pinsker's inequality 
(jlj). ii) can be derived from i) and the Jensen inequality for the function x^. 

Theorem 7 . Let Bi, B2, ... be a sequence of measurable sets. Then the following 
equalities are true: 

1 

i) lim E{- V(P(Xm+i G B.m+l\Xl...Xm) \Xi...Xm)f) =0, (38) 

1 ^-^ 

n) E{- ^ \P{xm+i e Bm,+l\Xl...Xm) — Ru{Xm+l ^ Bm+1 

^ m=0 

Proof. Obviously, 
1 

E{-J2iP{xm+ie Bm+1 \Xi...Xm) - Ru{ Xm+1 £ Bm+1 \Xl...Xm)Y) < (39) 

m=0 
^ t-1 

- Y: E{\P{Xm+l G Bm+1 \Xi...Xm) -Ru{Xm+l ^ 5^+1 1 Xi . . .X„,) | + 

m=0 

\P{Xm+l £ Bm+l\Xi...Xm) ~ Ru{Xm+l £ Bm+1 \Xi...Xm)\) ■ 

From the Pinsker inequality (jlj) and convexity of the KL divergence ([3]) we obtain the 
following inequalities 

1 

- y: E{\p{xm+i e Bm+1 \Xi...Xm) - Ru{ Xm+1 e Bm+l\Xi...Xm)\+ (40) 

^ m=0 

\PiXm+l G Bm+l\Xl...Xm) — Ru{ Xm+1 G Bm+l\Xl...Xm)\f < 

const ^4 £j^^Yog "^(■^™+^ ^ -Bm+1 l^^l-'-^^m) _j_ J^g -^(-^"i+l ^ -^m+l 1 3^1 • • -^^m) ^ ^ 
t m=o Ru ( e Bm+l\Xi...Xm) Ru{ Xm+1 £ Pm+1 l^^l-- -^^m) 
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— T~~ Xl( / P{Xl---Xm){f p{Xm+l\Xl...Xm))log '^^^"'^^^^^'"^"''^ dX)dXm)- 
m=0 '"C/ l-^m+l l-^l- ••'^m j 

Having taken into account that the last term is equal to ^^j^.E'(log ruix'i '^xl) '^^ from 
fl39l) and fHOj) and Corollary 1 we obtain fl38l) . ii) can be derived from i) and the 
Jensen inequality for the function x"^. The theorem is proven. 

We have seen that in a certain sense the estimation ru approximates the density 
p. The following theorem shows that ru can be used instead of p for estimation of 
average values of certain functions. 

Theorem 8 . Let f be an integrable function, whose absolute value is bounded by 
a certain constant M. Then the following equalities are valid: 

I t-i . „ 
i) lim -E{ V ( / f{x)p{x\xi...Xm)dXm - / f{x)ru{x\xi...Xm)d\mf ) = 0, (41) 

ii) lim V I / f{x)p{x\xi...Xm)dXm- / f{x)ru{x\xi...Xm)d\m\) =0. 

m=0 

Proof. The last inequality from the following chain follows from the Pinsker's 
one, whereas all others are obvious. 



(y f{x)p{x\Xi...Xm)dXm- J f{x)ru{x\Xi...Xm)dXm)'^ = 
( J f{x){p{x\Xi...Xm)-ruix\Xi...Xm))dXmf < J {p{x\Xi...Xm)-ruix\Xi...Xm))dXmf 



< M (y \p{x\Xi...Xm) - ru{x\Xi...Xm)\dXm) < 
const / p{x\Xi...Xm)'^Og{p{x\Xi...Xm)/ruix\Xi...Xm)dXm- 



From these inequalities we obtain: 



t-i 



m=0 
t-1 



^ E{ f{x)p{x\xi...Xm)dXm - / f {x)ru {x\xi. . .Xm)dXmf) < (42) 



Eco„.*B(/p(x|....x„)log(p(.|.....,„)/r„(.|.....„))dA, 



m=0 

The last term can be presented as follows: 

t-i 



^ E{ p{x\xi...x^)log{p{x\xi...Xm)/ru{x\xi...x^))dXm) = 

J2 J PiXi...Xm) J pix\Xi...Xm)log{p{x\Xi...Xm)/ruix\Xi...Xm))dXdXm) 



m=0 

t-1 



p{xi...xt) log{p{xi...Xt)/ru{xi...Xt))dXt. 

From this equality, and Corollary 1 we obtain fHT]) . ii) can be derived from 
and the Jensen inequality for x"^. Theorem is proven. 
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5 The Experiments 



In this part we describe the results of some experiments and a simulation study carried 
out in order to evaluate the efficiency of the suggested algorithms, paying the main 
attention to the prediction problem. The obtained results show that, in general, the 
described approach can be used in applications. 

5.1 Simulations 

We constructed several artificial samples created by processes with known structure 
and tried to predict the next value (xn+i) of the process based on xi, x^. WinRAR 
archiver (http://www.rarlab.com) was chosen as a code for constructing predictors. 
The scheme of experiments is as follows. Let Xi . . . x„ be the generated sequence. 
Denote by x* the estimation of Xn+i- For each n we calculate the density ru{xi . . . x„) 
and the average value (according to this density), which is output as the predicted 
value X*. 

The ffist process was created according to the following formula: Xi = sin (tt * i/23) . 
In this experiment we used WinRAR with the medium quality of compression. 
After every experiment the error of the prediction rj = \x* — Xn+i\ was evaluated. 
We compared these values with errors of the so-called inertial predictor, where the 
estimation of (unknown) Xn+i is defined ). The obtained results 

are given in the following table. 



Number of experiments 


Length of a sample sequence (n) 


Suggested 


Inertial 


100 


1000 


0.37 


0.41 


100 


2000 


0.37 


0.46 


100 


3000 


0.34 


0.45 



The numbers given in the first line of the table mean that 100 experiments were 
carried out, the length of the observed data is equal to 1000 (n = 1000), the mean 
value of the error (Z^i^i ^i/100) of prediction using the suggested method is 0.37, 
whereas the mean value of inertial prediction is 0.41. 

The second was a "random mixture" of the four following functions: /i(«) = 
[5 * sin (tt * z/16)], /2(z) = [7 * sin (vr * i/ + 7r/5)], /^{i) = [8 * sin (tt * z/3)], /^{i) = 
[8 * sin (tt * i/23)]. More precisely, ffist the length of a segment was randomly chosen 
according to the Poisson distribution (with a parameter A = 0.1), then the function 
on each segment was chosen randomly (with the probability 1/4) and values of the 
segment were generated according to the chosen formula. The results of this experi- 
ment are given in the table below. 



Number of experiments 


Length of a sample sequence (n) 


Suggested 


Inertial 


100 


2000 


1.43 


2.2 


100 


5000 


2.97 


4.27 


100 


10000 


3.07 


3.4 
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5.2 Prediction of currency rate 



To carry out this experiments we took values EURO/USD from Forex stock 



(http://www.forex.com). The scheme of the experiments is mainly the same as in the 
previous section. In these experiments we used WinRAR data compression method 
and predictor R. First of all we carried out few experiments to find best parameters 
for prediction. R showed better results than WinRAR archiver. We took independent 
samples and carried out experiments as it was described above. The results are given 
in the table below. 



Number of experiments 


Length of a sample sequence (n) 


Suggested 


Inertial 


100 


600 


0.0150 


0.0175 


100 


600 


0.0143 


0.0165 


100 


600 


0.0131 


0.0162 


100 


600 


0.0164 


0.0175 



So, we can see that predictors which are based on data compression methods have 
reasonable performance in practice. 



References 

[1] p. Algoet. Universal Schemes for Learning the Best Nonlinear Predictor Given 
the Infinite Past and Side Information, IEEE Trans. Inform. Theory, v. 45, pp. 
1165-1185, 1999. 

[2] Barron A.R. The strong ergodic theorem for dencities: generalized Shannon- 
McMillan-Breiman theorem. The annals of Probability, v. 13, n.4, pp. 1292-1303, 
1985. 

[3] Billingsley P., Ergodic theory and information, 1965. John Wiley & Sons. 

[4] Cilibrasi R., Vitanyi P.M.B. Clustering by Compression. IEEE Transactions on 
Information Theory, v. 51, n.4. 2005. 

[5] Cilibrasi R., de Wolf R., and Vitanyi P.M.B. Algorithmic Clustering of Music. 
Computer Music Journal, v. 28, n. 4, pp. 49-67, 2004. 

[6] Csiszdr I., Korner J. Information Theory: Coding Theorems for Discrete Memo- 
ryless Systems. Budapesht, Akademiai Kiado, 1981. 

[7] Csiszdr I., Shields P., 2000, The consistency of the BIC Markov order estimation. 
Annals of Statistics, v. 6, pp. 1601-1619. 

[8] Darbellay G.A., Vajda I., 1998. Entropy expressions for multivariate continuous 
distributions. Research Report no 1920, UTIA, Academy of Science, Prague (li- 
brary@utia.cas.cz). 



19 



[9] Darbellay G.A., Vajda I., 1999. Estimatin of the mutual information with data- 
dependent partitions. IEEE Trans. Inform. Theory. 48(5), 1061-1081. 

[10] Effros, M., Visweswariah, K., Kulkarni, S.R., Verdu, S., Universal lossless source 
coding with the Burrows Wheeler transform. IEEE Trans. Inform. Theory. 45, 
1315-1321. 

[11] Feller W., 1970. An Introduction to Probabability Theory and Its Applications, 
vol.1. John Wiley & Sons, New York. 

[12] Fitingof B.M. Optimal encoding for unknown and changing statistica of mes- 
sages. Problems of Information Transmission, v.2, n. 2, pp. 3-11, 1966. 

[13] Gallager R.G., 1968. Information Theory and Reliable Communication. John 
Wiley & Sons, New York, 1968. 

[14] Gyorfi, L.; Morvai, G.; Yakowitz, S.J.;Limits to consistent on-line forecasting for 
ergodic time series. IEEE Transactions on nformation Theory, v. 44, n. 2, pp. 886 
- 892, 1998. 

[15] Jacquet P., Szpankowski W., Apostol L. Universal predictor based on pattern 
matching. IEEE Trans. Inform. Theory, v. 48, pp. 1462-1472., 2002. 

[16] Kelly J.L.A new interpretation of information rate, Bell System Tech. J., V. 35, 
pp. 917-926, 1956. 

[17] Kieffer J., 1998. Prediction and Information Theory, Preprint, (available at 
|ftp:/ /oz. ee.umn.edu/users/kieffer/papers/prediction.pdf/, ) 

[18] Kieffer, J.C., En-Hui Yang, 2000. Grammar-based codes: a new class of universal 
lossless source codes. IEEE Transactions on Information Theory, 46 (3), 737 - 754. 

[19] Knuth D.E. The art of computer programming. Vol.2. Addison Wesley, 1981. 

[20] Kolmogorov A.N. Three approaches to the quantitative definition of information. 
Problems Inform. Transmission, v.l, 1965, pp. 3-11. 

[21] Krichevsky R. A relation between the plausibility of information about a source 
and encoding redundancy Problems Inform. Transmission, v.4, n.3, 1968, pp. 
48-57. 

[22] Krichevsky R. Universal Compression and Retrival. Kluver Academic Publishers, 
1993. 

[23] Kullback S. Information Theory and Statistics. Wiley, New York, 1959. 

[24] Modha D.S., Masry E. Memory-universal prediction of stationary random pro- 
cesses. IEEE Trans. Inform. Theory, 44, n.l, 117-133. 



20 



[25; 
[26; 
[27; 
[28; 

[29 
[30 
[31 

[32; 
[33; 

[34 

[35; 

[36 



[37; 
[38; 

[39 

[4o; 



Morvai G. , Yakowitz S.J., Algoet P.H. , 1997. Weakly convergent nonparametric 
forecasting of stationary time series. IEEE Trans. Inform. Theory, 43, 483 - 498. 

Nobel A.B., 2003. On optimal sequential prediction. IEEE Trans. Inform. Theory, 
49(1), 83-98. 

Rissanen J., 1984. Universal coding, information, prediction, and estimation. 
IEEE Trans. Inform. Theory, 30(4) 629-636. 

Rukhin A. and others. A statistical test suite for random and pseudorandom 
number generators for cryptographic applications. NIST Special Publication 800- 



22 (with revision dated May,15,2001). |http://csrc.nist.gov/rng /SP800-22b.pdf| 



Ryabko B.Ya., 1984. Twice-universal coding. Problems of Information Transmis- 
sion, 20(3) 173-177. 

Ryabko B.Ya., 1988. Prediction of random sequences and universal coding. Prob- 
lems of Inform. Transmission, 24(2) 87-96. 

Ryabko B.Ya., 1990. A fast adaptive coding algorithm. Problems of Inform. 
Transmission, 26(4) 305-317. 

Ryabko, B. Ya. The complexity and effectiveness of prediction algorithms. J. 
Complexity 10 (1994), no. 3, 281-295. 

B. Ryabko and J. Astola. "Universal Codes as a Basis for Time Series Testing " 
"Statistical Methodology" v.3, pp.375-397 ,2006, 

B. Ya. Ryabko, V.A. Monarev. Using information theory approach to randomness 
testing. Journal of Statistical Planning and Inference, 2005, v. 133, n.l, pp. 95-110. 

Ryabko B., Topsoe F., 2002. On Asymptotically Optimal Methods of Prediction 
and Adaptive Coding for Markov Sources. Journal of Complexity, 18(1) 224-241. 

Ryabko D., Hutter M., Sequence prediction for non-stationary processes. In pro- 
ceedings: Combinatorial and Algorithmic Foundations of Pattern and Association 



Discovery. Dagstuhl Seminar 2006, Germany http://www.dagstuhl.de/06201/ see 
also |http: / / arxiv.org/ pdf/ cs.LG / 0606077| 

Savari S. A., 2000. A probabilistic approach to some asymptotics in noiseless 
communication. IEEE Transactions on Information Theory 46(4): 1246-1262. 

Shannon C. E., A mathematical theory of communication. Bell Sys. Tech. J., 
vol. 27, pp. 379-423 and pp. 623-656, 1948. 

Shannon C.E. Communication theory of secrecy systems. Bell Sys. Tech. J., vol. 
28, pp. 656-715, 1948. 

Shields, P.C., The interactions between ergodic theory and information theory., 
IEEE Transactions on Information Theory, v. 44, n. 6, 1998, pp. 2079 - 2093 



21 



[41] W. Szpankowsky. Average case analysis of algorithms on sequences. John Wiley 
and Sons, New York, 2001. 



22 



